DMDD: A Large-Scale Dataset for Dataset Mentions Detection
نویسندگان
چکیده
Abstract The recognition of dataset names is a critical task for automatic information extraction in scientific literature, enabling researchers to understand and identify research opportunities. However, existing corpora mention detection are limited size naming diversity. In this paper, we introduce the Dataset Mentions Detection (DMDD), largest publicly available corpus task. DMDD consists main corpus, comprising 31,219 articles with over 449,000 mentions weakly annotated format in-text spans, an evaluation set, which comprises 450 manually purposes. We use establish baseline performance linking. By analyzing various models on DMDD, able open problems detection. invite community our as challenge develop novel models.
منابع مشابه
Jacquard: A Large Scale Dataset for Robotic Grasp Detection
Grasping skill is a major ability that a wide number of real-life applications require for robotisation. Stateof-the-art robotic grasping methods perform prediction of object grasp locations based on deep neural networks which require huge amount of labeled data for training and prove impracticable in robotics. In this paper, we propose to generate a large scale synthetic dataset with ground tr...
متن کاملVoxCeleb: A Large-Scale Speaker Identification Dataset
Most existing datasets for speaker identification contain samples obtained under quite constrained conditions, and are usually hand-annotated, hence limited in size. The goal of this paper is to generate a large scale text-independent speaker identification dataset collected ‘in the wild’. We make two contributions. First, we propose a fully automated pipeline based on computer vision technique...
متن کاملA Diverse Large-scale Dataset for Evaluating Rebroadcast Attacks
We describe the acquisition of a large, diverse set of rebroadcast images captured by a screen-grab, scanning a printed photo, or rephotographing a displayed or a printed photo. This dataset consists of 14, 500 rebroadcast images captured from a diverse set of devices: 234 displays, 173 scanners, 282 printers, and 180 recapture cameras. The diversity of this dataset—across devices and types of ...
متن کاملA Large-scale Dataset and Benchmark for Similar Trademark Retrieval
Trademark retrieval (TR) has become an important yet challenging problem due to an ever increasing trend in trademark applications and infringement incidents. There have been many promising attempts for the TR problem, which, however, fell impracticable since they were evaluated with limited and mostly trivial datasets. In this paper, we provide a large-scale dataset with benchmark queries with...
متن کاملLarge-scale Multiview 3D Hand Pose Dataset
Accurate hand pose estimation at joint level has several uses on human-robot interaction, user interfacing and virtual reality applications. Yet, it currently is not a solved problem. The novel deep learning techniques could make a great improvement on this matter but they need a huge amount of annotated data. The hand pose datasets released so far present some issues that make them impossible ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Transactions of the Association for Computational Linguistics
سال: 2023
ISSN: ['2307-387X']
DOI: https://doi.org/10.1162/tacl_a_00592